The American Journal of Human Genetics — Latest Matching Preprints

1

Pitfalls in estimating and interpreting the contribution of ultra-rare genetic variants to the heritability of complex traits

Wang, H.; Wainschtein, P.; Sidorenko, J.; Fikere, M.; Zhang, Y.; Kemper, K. E.; Zheng, Z.; Hivert, V.; Zeng, J.; Goddard, M. E.; Visscher, P. M.; Yengo, L.

2026-04-07 genetic and genomic medicine 10.64898/2026.04.06.26350278 medRxiv

Top 0.1%

53.5%

Show abstract

Assessing the contribution of ultra-rare variants (minor allele frequency <0.01%) to the heritability of complex traits remains challenging due to limited understanding of potential biases. Here, we focus on singletons (that is, variants observed only once in the study sample), the most abundant class of ultra-rare variants, to showcase various confounders of heritability estimates and underline pitfalls in their interpretation. We show through theory, simulations, and analysis of 5,330,210 exome-sequenced singletons in 305,813 unrelated European-ancestry individuals in the UK Biobank that (i) population stratification induces both upward and downward biases in singleton-based heritability estimates (), (ii) estimates capture non-additive genetic effects, and (iii) asymptotic standard errors of estimates from likelihood-based procedures are generally mis-calibrated when traits are not normally distributed. We further showcase these biases in real-data analyses of 22 quantitative phenotypes and report, after accounting for these pitfalls, significant estimate for number of children (3.4%), peak expiratory flow (1.9%), red blood cell count (2.5%), white blood cell count (1.9%) and heel bone mineral density (2.4%). Overall, our study provides recommendations for robust inference of heritability from ultra rare variants and underscores that reliable estimates for ordinal and binary traits will require far larger sample sizes and improved methods, given that confounding in these traits remains difficult to detect and correct

2

Modeling rare coding variation on chromosome X provides insight into the genetics and differential sex prevalence of autism spectrum disorder

Satterstrom, F. K.; Jodeiry, K.; Mahjani, B.; Hatem, G.; Park, S. J.; Klei, L.; Fu, J. M.; Wigdor, E. M.; the Autism Sequencing Consortium, ; Betancur, C.; Daly, M. J.; Roeder, K.; Devlin, B.; Buxbaum, J. D.; Cutler, D. J.

2026-05-07 genetic and genomic medicine 10.64898/2026.05.04.26352380 medRxiv

Top 0.1%

41.0%

Show abstract

Autism spectrum disorder (ASD) is estimated to be up to four times as common in males as in females, yet the causes of this prevalence difference are not well established. One possible driver is genetic variation on the X chromosome, as it contains genes capable of contributing to ASD (e.g., PTCHD1, MECP2) and is known to play a role in genetic disorders with differential sex prevalence (e.g., color blindness). However, a lack of power compared to the autosomes combined with the complexities of modeling its biology have led to the X being largely overlooked in sequencing studies. Here, we develop quantitative X-linked TADA, a new model designed specifically for application to this chromosome, and use it to analyze rare variation from 50,663 individuals with ASD (and 136,670 individuals total). We find 9 genes on the X associated with ASD at a false discovery rate (FDR) < 0.05 and an additional 9 genes at FDR < 0.2, with many of these previously identified as involved in specific neurodevelopmental disorders. Point estimates of the liability conferred by de novo variants on the X are similar in females and males, with both sexes estimates elevated >20% above the corresponding autosomal values. We also develop a general theory of how X-linked variation of any additive or non-additive effect influences liability and describe its implications for prevalence. Using this theory and our empirical results, we show how genetic variation on the X could contribute to the sex-differential prevalence of ASD.

3

EA-PheWAS: Integrating Phenotype Embeddings with PheWAS for Enhanced Gene-Phenotype Discovery

Zheng, W.; Liu, T.; Xu, L.; Xie, Y.; Jing, Y.; Shao, H.; Zhao, H.

2026-04-22 genetics 10.64898/2026.04.21.720031 medRxiv

Top 0.1%

38.5%

Show abstract

Phenome-wide association studies (PheWAS) enable systematic exploration of relationships between genetic variants and clinical phenotypes derived from electronic health records (EHRs). Conventional regression-based PheWAS treats phenotypes separately and relies on binary phenotype representations, which limits statistical power for rare variants and rare phenotypes and reduces the ability to detect associations with phenotypes that are distributed across clinical codes. To address this limitation, we first developed EmbedPheScan, a phenotype embedding-based prioritization framework that summarizes the phenotypic profiles of rare loss-of-function variant carriers in a continuous embedding space. We then proposed EA-PheWAS by combining these embedding-derived signals with conventional regression-based PheWAS results using the aggregated Cauchy association test. Using the UK Biobank whole-exome sequencing and EHR data, we show that the proposed methods maintain appropriate false-positive control. We then performed genome-wide phenome scans across all genes and across biologically defined gene classes to evaluate EA-PheWAS relative to conventional PheWAS and EmbedPheScan, consistently finding that EA-PheWAS outperformed the other two methods. We illustrate the utility of EA-PheWAS focusing on four genes representing distinct scenarios, including strong-effect disease genes (PKD1, PKD2), genes with large numbers of rare LoF carriers (NF1), and genes with extremely sparse carrier counts (FBN1).

4

Bayesian Estimation of Mosaic Loss of Chromosome Y from Bulk RNA Sequencing Data

Lin, J.-R.; Zhang, Z.

2026-05-23 genomics 10.64898/2026.05.20.726153 medRxiv

Top 0.1%

34.8%

Show abstract

Mosaic loss of chromosome Y (LOY) is a common age-associated somatic alteration in men and is typically measured from DNA-based assays. Many cohorts, however, contain bulk RNA-seq data without matched DNA-based LOY measurements. We developed a Bayesian framework to estimate the fraction of cells with LOY from male bulk RNA-seq by modeling reduced Y-linked gene expression relative to expected expression after adjustment for age, expression covariates, and autosomal/X-linked control genes. In 377 male GTEx samples, individual Y-linked genes showed negative correlations with separately obtained DNA-based LOY measurements, supporting a shared Y-expression depletion signal. The primary fast empirical Bayes estimator achieved a Pearson correlation of 0.678 with measured LOY, a mean absolute error of 1.79%, a root mean squared error of 3.72%, and 95.2% empirical coverage of measured LOY. Performance was strongest for identifying large LOY events, with an AUC of 0.964 for measured LOY greater than 20%, while fine ranking among low-LOY samples remained uncertain. A mixture/PCA hierarchical Bayesian sensitivity model provided similar validation performance and interpretable posterior quantities but did not improve point estimation. Leave-one-Y-gene-out and prior-sensitivity analyses showed that the signal was distributed across multiple Y-linked transcripts and that prior shrinkage affected calibration. In an external whole-blood RNA-seq dataset without measured LOY, estimated LOY showed a modest age-related increase, but ex vivo immune stimulation shifted RNA-derived LOY estimates and reduced multiple Y-linked transcripts, indicating transcriptional confounding. These results show that bulk RNA-seq contains usable information about LOY, especially for larger events, but RNA-derived LOY should be interpreted as a probabilistic transcriptome-based estimate rather than a direct substitute for DNA-based mosaicism measurement.

5

Systematic assessment of machine learning-based variant annotation methods for rare variant association testing

Aguirre, M.; Irudayanathan, F. J.; Crow, M.; Hejase, H. A.; Menon, V. K.; Pendergrass, R. K.; McCarthy, M. I.; Fletez-Brant, K.

2026-03-20 bioinformatics 10.64898/2026.03.18.712715 medRxiv

Top 0.1%

34.7%

Show abstract

Machine learning-based annotation methods are increasingly used to assess the pathogenicity of genetic variants, but their performance at prioritizing variants for gene-level association testing remains poorly characterized. Here, we systematically benchmark five annotation methods -- CADD v1.6, CADD v1.7, AlphaMissense, ESM-1b, and GPN-MSA -- using four primary gene-based tests and six annotation-level aggregation tests across 14 quantitative traits measured in up to 350,377 UK Biobank participants. Using a novel framework based on Wasserstein dis-tances, we quantify how annotation choice affects test calibration and power. Tests using CADD annotations achieve the highest signal separation, while tests using AlphaMissense annotations exhibit systematically lower calibration. All combinations of methods produced significant re-sults that were enriched (1.8-5.8-fold) for loss-of-function intolerant genes, though tests using GPN-MSA annotations displayed the highest such enrichment. Replication across symmetric phenotypes and loss-of-function burden tests was generally similar across methods. Our anal-ysis provides practical guidance for annotation method selection in rare variant studies and establishes a distributional framework for calibration assessment.

6

LANTERN: Leveraging Local Ancestry Tracts to Enhance Rare-Variant Aggregate Association Testing

Wang, Y.; Tuftin, B.; Raffield, L. M.; Hidalgo, B.; Kerns, S. L.; DeWan, A. T.; Leal, S. M.; Auer, P.

2026-04-27 genetic and genomic medicine 10.64898/2026.04.24.26351693 medRxiv

Top 0.1%

33.0%

Show abstract

Individuals with admixed ancestry comprise a significant proportion of populations of the Americas. Statistical methods have been developed to specifically leverage local ancestry inference to enhance the power and interpretability of genome-wide association studies in admixed populations. However, no such methods currently exist to test for rare-variant aggregate associations. Here we present LANTERN (Leveraging local ANcestry Tracts to Enhance Rare variaNt aggregate associations), a method that infers the alleles that lie on each ancestral haplotype and conducts rare-variant aggregate association testing in a generalized linear mixed model framework. Through simulation studies we demonstrated that LANTERN achieves proper control of Type 1 error while boosting power to detect associations when causal alleles predominately lie on one ancestral haplotype. Using data from a cohort of African American participants from the Jackson Heart Study, LANTERN identified two genes known to be involved in red-blood cell (RBC) biology when local ancestry information was incorporated. Specifically, a burden of rare alleles on European ancestral haplotypes in EPO was associated with both hemoglobin levels (HGB) and RBC counts, whereas a burden of rare alleles on African ancestral haplotypes in EPB42 was associated with HGB and RBC. In summary, LANTERN (i) allows for the identification of ancestry-specific rare-variant associations; and (ii) enhances rare-variant association signals compared to an analysis that ignores local ancestry. LANTERN is implemented in R and is freely available on GitHub.

7

The reliability and accuracy of recombination inferred by Shapeit2 duoHMM on whole genome sequence

Oubninte, S.; Ruczinski, I.; Yanek, L. R.; Mathias, R.; Bureau, A.

2026-05-10 genomics 10.64898/2026.05.06.723015 medRxiv

Top 0.1%

32.6%

Show abstract

Few studies assessed the performance of population-based phasing combined with parental genotypes to infer recombination on whole genome sequence (WGS) data. In this study, our objective was to evaluate whether Shapeit2 duoHMM, a Hidden Markov Model using parental genotypes, infers recombination events reliably on WGS and with narrower intervals than SNP arrays. We based our analysis on the overlap between recombination events inferred by Merlin on SNP genotypes and Shapeit2 on WGS and SNP genotypes. We used a sample of 61 extended families from the GeneSTAR study with TopMED freeze 8 WGS on 580 sequenced subjects (60% of sample). Shapeit2 was run with a window size of 500 kilobases and 200 states on WGS. To mimic a SNP array, we extracted genotypes of 355,112 autosomal markers on the Illumina OmniExpress array. The number of recombination events per meiosis inferred by Shapeit2 on the WGS data (36.8) was aligned with the expected numbers over autosomes (35.7), although Merlin overestimated this number (115.0). 73% of Shapeit2 recombination events on WGS were detected by Merlin, a proportion rising to 91% when restricting to events also inferred by Shapeit2 on OmniExpress genotypes. Furthermore, Shapeit2 recombination intervals were narrower on WGS than OmniExpress genotypes (median of 4,530 bp vs. 49,458 bp). This suggests that Shapeit2 on WGS is a reliable and accurate method for inferring recombination events.

8

A General Statistical Framework for Hardy-Weinberg Equilibrium Inference on the X Chromosome

Zhang, L.; Paterson, A. D.; Sun, L.

2026-05-20 genetics 10.64898/2026.05.17.725730 medRxiv

Top 0.1%

32.2%

Show abstract

Testing for Hardy-Weinberg equilibrium (HWE) is a fundamental component of genetic data analysis, widely used for quality control and model validation. Although HWE testing is well established for autosomal loci, inference on the X chromosome is more complex due to sex-specific genotype structures and potential sex differences in minor allele frequency (sdMAF). Existing tests differ in their assumptions about sdMAF and male sample inclusion, often leading to distinct but poorly characterized null hypotheses. We develop a general statistical framework for HWE inference using the robust allele-based regression model. By formulating HWE testing as an assessment of allele-level dependence, the framework directly parameterizes Hardy-Weinberg disequilibrium, unifies existing Pearson{chi} 2-based tests under explicit modeling assumptions, and clarifies their null hypotheses, degrees of freedom, and sensitivity to sdMAF. The framework also accommodates covariate and population-structure adjustment within a unified regression-based formulation. The proposed framework provides robust, interpretable, and flexible inference, establishing a unified statistical foundation for HWE testing across autosomal and X-chromosomal regions. Simulation studies and analysis of high-coverage 1000 Genomes Project data demonstrate that commonly used X-chromosome tests can exhibit inflated type I error or misleading inference when sdMAF is present.

9

A meta-analysis of chromatin-associated loci provides insights into mechanistic interpretations of trait heritability

Dudek, M. F.; Wenz, B. M.; Voight, B. F.; Almasy, L.; Grant, S. F. A.

2026-03-20 genetics 10.64898/2026.03.19.712994 medRxiv

Top 0.1%

31.9%

Show abstract

The vast majority of trait-associated loci discovered through genome-wide association studies (GWAS) are non-coding, yet most lack statistical alignment with any discovered expression quantitative trait loci (eQTLs). In particular, eQTLs are depleted at gene-distal regions and at "functionally important" genes - those with strong selective constraint and complex regulatory landscapes - likely due to selective depletion of high-effect variants. Here, we investigate the role of variants with weaker effects on expression transmitted through distal regulatory elements, which are detectable as chromatin accessibility QTLs (caQTLs). We aggregated caQTL data from ten studies derived across different tissues, cell-types and lines, representing 104,024 lead caQTLs across 3,457 samples. We found that, across a range of gene properties, caQTLs are discovered at functionally important genes more often than eQTLs. These observations are consistent with a model in which many eQTLs and GWAS hits are mediated through genetic effects on regulatory elements, which may have weak or context-dependent effects on gene expression. Our results suggest that caQTL discovery is more sensitive than eQTL discovery in capturing the molecular consequences of GWAS hits, and can provide complimentary information to eQTLs by implicating functional mechanisms of additional disease-associated loci.

10

Optimizing phenotype scale improves genetic analyses in large-scale biobanks

Huang, Z.; Costantino, M.; Dahl, A.

2026-05-07 genetics 10.64898/2026.05.04.722531 medRxiv

Top 0.1%

28.7%

Show abstract

Large-scale biobanks have enabled increasingly complicated genetic analyses across thousands of phenotypes. However, studies rarely consider the appropriate phenotype measurement scale, a problem that can drastically affect inferences on genetic architecture. Here, we introduce SIQReg, a practical solution to this classical problem, which learns a data-driven phenotype scale by minimizing heterogeneity across phenotype quantiles. Applied to complex traits in UK Biobank, SIQReg rejects the default scale for 24/25 traits. Generally, SIQReg scales lie between default and logarithmic, indicating that default-scale traits are neither purely additive nor purely multiplicative. We show that SIQReg improves both non-additive and additive genetic analyses. SIQReg eliminates most non-additive genetic signals (such as 97% of vQTL and 76% of quantile-dependent TWAS genes), indicating they may be statistical artifacts, while preserving biologically plausible non-additive signals. Simultaneously, SIQReg improves power to detect additive signals, increasing GWAS loci, TWAS genes, and PGS prediction accuracy by 11%, 13%, and 10%, respectively, and identifies 50% more high-risk individuals. These gains replicate across ancestry groups. Our results establish SIQReg as a principled approach to phenotype scale transformation that improves genetic analyses of complex traits.

11

Widespread genetic effect heterogeneity impacts bias and power in nonlinear Mendelian randomization

Wang, J.; Morrison, J.

2026-04-20 epidemiology 10.64898/2026.04.17.26351133 medRxiv

Top 0.1%

28.3%

Show abstract

1Mendelian randomization (MR) uses genetic variants as instrumental variables to infer causal relationships between complex traits. Standard MR can be used to estimate an average causal effect at the population level, and typically assumes a linear exposure-outcome relationship. Recently, several methods for estimating nonlinear effects have been developed. However, many have been found to produce spurious empirical findings when subjected to negative control analyses. We propose that this poor performance may be attributable to heterogeneity in variant-exposure associations. We demonstrate that heterogeneous genetic effects on exposure lead to biased estimates, poor coverage, and inflated type I error in control function and stratification-based methods. In contrast, two-stage least squares (TSLS) methods are robust to such heterogeneity, but suffer from low precision and low power in some circumstances. We show that a statistical test for heterogeneity can be used to guide the choice of nonlinear MR methods. Using UK Biobank data, we reassess the causal effects of BMI, vitamin D, and alcohol consumption on blood pressure, lipid, C-reactive protein, and age (negative control). We find strong evidence of heterogeneity for all three exposures, and also recapitulate previous results that control function and stratification-based methods are prone to false positives. Finally, using nonparametric TSLS, we identify evidence of nonlinear causal effects of BMI on HDL cholesterol, triglycerides, and C-reactive protein; however, specific estimates of the shape of these relationships are imprecise. Altogether, our results suggest that common nonlinear MR methods are unreliable in the presence of realistic levels of heterogeneity, and that more methodological development is required before practically useful nonlinear MR is feasible.

12

A Common Pathogenic Founder Variant in Rwandan Breast Cancer Cases

Manirakiza, A. V.; Baichoo, S.; Uwineza, A.; Dukundane, D.; Rugengamanzi, E.; Mutamuliza, J.; Niragira, A.; Muvunyi, R.; Besada, J.; Nielsen, S.; Bucknor, B.; Koeller, D. R.; Andrews, C.; Mutesa, L.; Fadelu, T.; Rebbeck, T. R.

2026-05-29 genetics 10.64898/2026.05.26.727861 medRxiv

Top 0.1%

27.3%

Show abstract

Germline data from African populations remain sparse, limiting characterization of population-specific BRCA1/2 pathogenic variants. In a study of 175 Rwandan women with breast cancer, 7 unrelated carriers (4% of cases; 22% of pathogenic variant carriers) harbored the same BRCA1 frameshift variant, c.4065_4068del (p.Asn1355Lysfs*10), which is extremely rare in gnomAD yet recurrent in European, Asian, and Middle Eastern cohorts. Whole-exome sequencing and haplotype analysis of all 7 carriers revealed a shared ancestral block of approximately 581 kb surrounding the variant, and extended haplotype homozygosity and network analyses confirmed a common founder origin. Coalescent-based age estimation placed the founder event approximately 4,000--10,000 years ago. Comparison with 1000 Genomes Project data showed the founder haplotype is absent or exceedingly rare outside African and South Asian populations. These findings strongly suggest the c.4065_4068del variant as a pre-historical BRCA1 founder variant in Rwanda, with implications for targeted genetic testing, cascade screening, and cancer prevention in the region.

13

A biobank-scale method for learning modulators of gene-environment interaction underlying human complex traits from multiple environmental exposures

Liu, Z.; Ramteke, A.; Anand, A.; Gorla, A.; Jeong, M.; Sankararaman, S.

2026-03-16 genetics 10.64898/2026.03.13.711725 medRxiv

Top 0.1%

26.4%

Show abstract

It is increasingly recognized that genetic effects on complex traits and diseases are shaped by environmental context. Biobanks that measure diverse environmental exposures alongside genotypes and phenotypes at scale enable systematic study of gene-environment (GxE) interactions. Existing approaches, however, are limited in their ability to accurately model polygenic GxE involving many exposures across genome-wide genetic variants. It is unclear which exposure combinations are relevant for a given trait while distinguishing true interactions from environment-dependent heteroskedastic noise. To address these challenges, we develop Efficient multi-eNvironmental Gene-environment Interaction iNference Estimator (ENGINE), a supervised variance-component framework that learns an embedding that combines multiple environmental exposures while jointly estimating additive, GxE, and heteroskedastic noise components. To enable biobank-scale inference, ENGINE makes a single pass over the genotype matrix to cache genotype-dependent summaries, then assembles normal-equation components and gradients at each iteration. In simulations, ENGINE controls type I error rates, achieves high power, and accurately recovers the environmental embedding while remaining efficient at biobank-scale. Applied to five complex traits paired with lifestyle exposures in N = 291,273 unrelated white British individuals and M = 454,207 common SNPs (MAF> 0.01) from the UK Biobank, ENGINE recovered GxE variance that was on average 1.4-fold larger than that captured by a single exposure and 5.5-fold larger than that captured by the first principal component of the exposures.

14

De novo EHMT2 variants cause an autosomal dominant EHMT2-related Kleefstra syndrome via loss of G9a methyltransferase activity.

Hnizda, A.; Martinez-Delgado, B.; Sanchez-Ponce, D.; Alonso, J.; Amiel, J.; Attie-Bitach, T.; Bada-Navarro, A.; Baladron, B.; Bermejo-Sanchez, E.; Brinsa, V.; Bukova, I.; Cazorla-Calleja, R.; Cervenkova, S.; Chow, S.; Dusek, P.; Fedosieieva, O.; Fernandez-Prieto, M.; Ghosh, S.; Gomez-Mariano, G.; Gregorova, A.; Hamilton, M. J.; Hartmannova, H.; Hernandez-San Miguel, E.; Herrero-Matesanz, M.; Hodanova, K.; Kadek, A.; Kerkhof, J.; Kleefstra, T.; Lacombe, D.; Levy, M. A.; Lopez-Martin, E.; Lyse, R.; Man, P.; Marin-Reina, P.; Macnamara, E. F.; McConkey, H.; Melenovska, P.; Mielu, L. M.; Moore, D.;

2026-04-20 genetics 10.1101/2025.09.25.678439 medRxiv

Top 0.1%

23.6%

Show abstract

EHMT1 and EHMT2 genes encode human euchromatin histone lysine methyltransferase 1 and 2 (EHMT1 alias GLP; EHMT2 alias G9a) that form heteromeric GLP/G9a complexes with essential roles in epigenetic regulation of gene expression. While EHMT1 haploinsufficiency has been established as the cause of Kleefstra syndrome 1, the pathogenesis of G9a dysfunction in human disease remains largely unknown. We identified seven de novo EHMT2 variants in patients with clinical presentation, episignatures, histone modifications and transcriptomic profiles similar to those of Kleefstra syndrome 1. In vitro studies revealed that these variants encode for structurally stable G9a proteins that are catalytically incompetent due to aberrant interactions either with histone H3 tail or with S-adenosylmethionine. Heterozygous mice carrying a patient-derived variant exhibited growth retardation, facial/skull dysmorphia and aberrant behavior. Here we report pathogenic EHMT2 variants that likely exert dominant-negative effect on GLP/G9a complexes and thus genocopy the EHMT1 haploinsufficiency via a distinct molecular mechanism, defining an autosomal dominant EHMT2-related Kleefstra syndrome.

15

Functionality-Informed Fine-Mapping Dissects Common Variant Contributions to Coronary Artery Disease and Identifies Causal Variants and Pathways

Jacobsen, J. T.; Moller, P. L.; Rohde, P. D.

2026-04-02 genetic and genomic medicine 10.64898/2026.04.01.26349823 medRxiv

Top 0.1%

23.2%

Show abstract

Genomics offer a powerful approach to identify causal mechanisms underlying coronary artery disease (CAD) risk, with implications for pathogenesis, personalized prevention strategies, and therapeutic target discovery. Functionality-informed genome-wide fine mapping was performed using the Bayesian framework SBayesRC to estimate genetic contributions of 6.9 million common variants, based on GWAS summary statistics from over one million individuals of European ancestry. Causal candidate genes were prioritized in a 5kB flanking window within high-confidence local credible sets (LCSs). Their downstream biological influence was analyzed using protein-protein interaction networks and pathway enrichment analyses across three complimentary dimensions: molecular, cellular, and disease level. Genetic modeling captured the highly polygenic architecture of CAD, estimating on average 34,000 variants to contribute to CAD risk, explaining 3.8% of total phenotypic variance. 36 high-confidence variants (PIP > 0.9) collectively explained 13.6% of genetic variance, while most variants demonstrated small individual effects but with substantial collective contributions. 17,150 variants were prioritized within 581 high-confidence LCSs, of which 195 were annotated to genes and 170 were implicated in downstream pathway analyses. The three most influential variants were mapped to PHACTR1, APOE, and LPL, explaining 2.49%, 1.59%, and 1.46% of genetic variance respectively. Pathway analyses revealed that genetic risk in CAD is driven by dysregulation of three interlinked biological processes: 1) lipoprotein function and cholesterol metabolism, 2) vascular homeostasis, and 3) cellular stress responses and inflammation. These findings advance the causal understanding of CAD pathogenesis, supporting the transition from association-based to functionality-informed genomic approaches in cardiovascular genetics.

16

Retrospective evaluation of human genetic evidence for clinical trial success using Mendelian randomization and machine learning

Ravarani, C. N. J.; Arend, M.; Baukmann, H. A.; Cope, J. L.; Lamparter, M. R. J.; Sullivan, J. K.; Fudim, R.; Bender, A.; Malarstig, A.; Schmidt, M. F.

2026-03-14 pharmacology and therapeutics 10.64898/2026.02.19.26346536 medRxiv

Top 0.1%

23.1%

Show abstract

Human genetics has become a cornerstone of drug target discovery, yet the value of Mendelian randomization (MR) for predicting clinical success remains uncertain. Here, we systematically evaluated MR across 11,482 target-indication pairs with documented Phase II clinical outcomes to assess its utility for drug development. We find that MR statistical significance alone does not enrich for Phase II success, in contrast to genome-wide association study (GWAS) support, which confers an increase in success probability. However, this apparent limitation reflects the heterogeneous nature of clinical failure and the fact that MR encodes information beyond P values. When MR-derived features, including instrument strength and explained variance, are integrated into machine learning models, predictive performance improves substantially. An MR-informed XGBoost classifier identifies target-indication pairs with a 55% overall approval rate, corresponding to a 6.4-fold enrichment over unstratified programs and a 2.8-fold improvement over GWAS- supported targets in Phase II. Notably, this enrichment is achieved without reliance on statistically significant MR results. Our findings demonstrate that MR is most informative when treated as a graded, context-dependent source of causal evidence rather than a binary hypothesis test, and that its integration with machine learning enables scalable, genetics-informed prioritization of drug targets across the clinical pipeline.

17

A drug repurposing screen reveals dopamine signaling as a candidate therapeutic pathway for PIGA-CDG

Aziz, M. C.; Wilson, J.; Chow, C. Y.

2026-04-18 genetics 10.64898/2026.04.17.719256 medRxiv

Top 0.1%

23.0%

Show abstract

PIGA-CDG is a congenital disorder of glycosylation caused by pathogenic partial loss-of-function variants in the PIGA gene. PIGA encodes an enzyme responsible for the catalytic transfer of N-acetylglucosamine to phosphatidylinositol during the first step of glycosylphosphatidylinositol anchor biosynthesis. Loss of this enzyme has a widespread phenotypic impact, but primarily results in neurological symptoms including seizures, intellectual disability, and developmental delay. Currently, treatments are limited and focus on symptom management. We developed an eye model of PIGA-CDG that has a reduced eye size. We screened a library of 98% 1,520 FDA/EMA-approved compounds to find drugs that improved the small eye phenotype. This screen revealed numerous drugs that improved eye size, including those that targeted dopamine signaling and cyclooxygenases. Using pharmacological and genetic approaches, we show that modulating dopamine signaling improves the eye size. Genetic inhibition of dopamine 2 receptor signaling and dopamine reuptake improve both the eye model and neurologically relevant PIGA-CDG phenotypes, including seizures and locomotor deficits. We also pharmacologically and genetically validate cyclooxygenase targeting drugs in the eye model. These findings reveal novel biology underlying PIGA-CDG and point towards candidate therapeutic approaches. AUTHOR SUMMARYPIGA-CDG is a rare neurodevelopmental disorder caused by pathogenic variants in the gene PIGA. Patients primarily display neurological symptoms, including seizures, developmental delay, and intellectual disability. Fewer than 100 patients have been identified, and treatment strategies are limited. In the context of rare diseases, de novo drug development is difficult due to the high cost, lengthy development times, and often too small of a patient population to conduct a clinical trial. Our lab leverages drug repurposing screening to circumvent many of the hurdles associated with de novo drug development. Here, we develop and screen FDA- or EMA-approved compounds on a Drosophila model of PIGA-CDG, uncovering novel biology underlying PIGA-associated pathophysiology. We use pharmacological and genetic tools to demonstrate that modifying dopamine signaling and abundance, as well as cyclooxygenase-mediated pathways, contribute to PIGA associated phenotypes. This work highlights promising therapeutic targets for PIGA-CDG.

18

Calibrated Prediction Intervals for Polygenic Scores: Updated Comparisons, Contextual Calibration, and Data Normalization

Chang, X.; Hou, S.; Zhou, X.

2026-05-19 genetic and genomic medicine 10.64898/2026.05.15.26353336 medRxiv

Top 0.1%

23.0%

Show abstract

Calibrated prediction intervals for polygenic scores (PGS) are essential for communicating individual-level uncertainty in genomic medicine. We present updated comparisons of two methods for constructing such intervals: CalPred, a parametric approach, and PredInterval, a non-parametric approach. Our results show that both methods can achieve calibrated coverage, although CalPred additionally requires a sufficiently large calibration set. The two methods also exhibit complementary trade-offs with respect to dataset size and risk identification. We further show that contextual calibration, as introduced in Hou et al. and followed in Shi et al., is most naturally achieved through appropriate phenotype normalization and data preprocessing. Apparent miscalibration can arise from inadequate normalization or from providing contextual information to some methods but not others. In UK Biobank, standard GWAS phenotype normalization procedures are sufficient to achieve contextual calibration for traits analyzed. In the extreme simulations of Hou et al. and Shi et al., supplying contextual covariates to PredInterval restores contextual calibration without normalization, and appropriate normalization can achieve contextual calibration without supplying covariates, while also substantially improving upstream tasks including association power and PGS accuracy. Together, these results underscore the central role of phenotype normalization and data preprocessing in GWAS analyses, including reliable uncertainty quantification for PGS.

19

Benchmarking of local ancestry inference with different assays and parameters

Motegi, T.; Huang, F.; Campbell, J. D.

2026-05-21 genomics 10.64898/2026.05.18.726085 medRxiv

Top 0.1%

22.9%

Show abstract

Local ancestry inference (LAI) enables high-resolution characterization of chromosomal segments inherited from distinct ancestral populations, offering unique insights into genetic architecture in admixed cohorts. While LAI is commonly performed with high-coverage whole-genome sequencing (WGS), the ability of other genotyping assays or varying sequencing depths has not been thoroughly benchmarked. In this study, we systematically evaluated the accuracy of LAI across SNP microarrays, whole-exome sequencing (WES), and ultra low-pass WGS (ULP-WGS) using diverse validation samples and state-of-the-art imputation pipelines. We show that ULP-WGS, when paired with GLIMPSE2, achieves robust accuracy at 0.25x coverage with a minimum genome window size of 0.5 centimorgans, with mean accuracy minus one standard deviation exceeding 95%. For WES, using "on-target" reads alone yields suboptimal performance, particularly for European and South Asian ancestries with accuracy less than 79.1% and 70.6%, respectively. However, incorporating "off-target" reads in WES and utilizing GLIMPSE2 substantially improved accuracy [≥]95% with a minimum window size of 0.2 centimorgans. We further evaluated formalin-fixed, paraffin-embedded (FFPE) samples and found that LAI could be performed successfully using WES data with accuracies of [≥]95% at a minimum window size of 0.5 centimorgans. In contrast, SNP microarrays did not achieve substantial accuracies at any window size ([≤]95%). Together, these results demonstrate that LAI is achievable without conventional high-coverage WGS and establish optimal parameters for LAI across platforms.

20

Calibration of in-frame indel variant effect predictors for clinical variant classification

Abderrazzaq, H.; Singh, M.; Babb, L.; Bergquist, T.; Brenner, S. E.; Pejaver, V.; O'Donnell-Luria, A.; Radivojac, P.; ClinGen Computational Working Group, ; ClinGen Variant Classification Working Group,

2026-04-18 bioinformatics 10.64898/2026.04.15.718599 medRxiv

Top 0.1%

22.8%

Show abstract

Insertions and deletions (indels) represent a substantial source of genetic variation in humans and are associated with a diverse array of functional consequences. Despite their prevalence and clinical importance, indels, particularly short in-frame indels, remain critically understudied compared to single nucleotide variants and are challenging to interpret clinically. While many computational predictors for missense variants have been rigorously evaluated and calibrated for clinical use, the clinical utility of tools for in-frame indels remains uncertain. To address this gap, we have calibrated in-frame indel prediction tools for clinical variant classification. We constructed a high-confidence dataset of in-frame indel variants ([≤] 50bp) from clinical and population databases and estimated the prior probability of pathogenicity of a rare in-frame indel observed in a disease-associated gene, and of an insertion and deletion separately. Using a previously developed statistical framework based on local posterior probabilities, we then established score thresholds for eight computational tools, corresponding to distinct evidence levels for pathogenic and benign classification according to ACMG/AMP guidelines. All in-frame indel predictors evaluated here reached multiple evidence levels of pathogenicity and/or benignity, demonstrating measurable clinical value. However, these models consistently exhibited lower performance levels compared to missense predictors, highlighting the need for improved computational approaches for indel classification.